Wisteria: Nurturing Scalable Data Cleaning Infrastructure
نویسندگان
چکیده
Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst’s choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.
منابع مشابه
A Software Infrastructure for the CLEENEX Optimizer
The problems associated to data quality is an increasingly growing concern. Throughout this document we will focus on a specific data quality problem: the existence of approximate duplicate records. Data cleaning aims at correcting data quality problems that can be found in various situations. There are some data cleaning tools that address these data quality problems. One of the tasks of a dat...
متن کاملA New Framework for Increasing the Sustainability of Infrastructure Measurement of Smart Grid
Advanced Metering Infrastructure (AMI) is one of the most significant applications of the Smart Grid. It is used to measure, collect, and analyze data on power consumption. In the AMI network, the smart meters traffics are aggregated in the intermediate aggregators and forwarded to the Meter Data Management System (MDMS). The infrastructure used in this network should be reliable, real-time an...
متن کاملHorticulture, hybrid cultivars and exotic plant invasion: a case study of Wisteria (Fabaceae)
Exotic Wisteria species are highly favoured for their horticultural qualities and have been cultivated in North America since the early 1800s. This study determines the identity, genetic diversity and hybrid status of 25 Asian Wisteria cultivars using plastid, mitochondrial and nuclear DNA data. Fifteen (60%) hybrid cultivars were identified. All of the ‘Wisteria sinensis’ cultivars sampled are...
متن کاملStatistical Distortion: Consequences of Data Cleaning
We introduce the notion of statistical distortion as an essential metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applicable yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improvement, statistical distortion and cost-related criteria. Existing metrics focus on glitch improvemen...
متن کاملBi-parental cytoplasmic DNA inheritance in Wisteria (Fabaceae): evidence from a natural experiment.
Cytoplasmic inheritance was investigated in interspecific hybrids of Wisteria sinensis and W. floribunda. Species-specific nuclear, mitochondrial and plastid DNA markers were identified from wild-collected plants of each species in its native range. These markers provide evidence for the bi-parental transmission of plastids in hybrid swarms of these two species in the southeastern USA. These po...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 8 شماره
صفحات -
تاریخ انتشار 2015